Metabolomics Filtering Analysis

Summary

Filtered metabolomics data to remove noise and identify significant peaks. Three approaches compared.

Original Data

23,134 Peaks
unique peaks before filtering

After 80% Filter

3,740 Peaks
16.2% of original kept
Key Finding: Root tissue reaches 80% much faster (avg ~345 peaks) than leaf tissue (avg ~923 peaks). This means roots have a few dominant compounds while leaves have signal spread across more compounds.

Source Data

Leaf vs Root

Root samples have more concentrated signal - fewer peaks make up 80% of the total.

Leaf Avg

923
peaks to reach 80%

Root Avg

345
peaks to reach 80%

Overlap Between Methods

ComparisonCount
In both 90% and ≥0.01%5,819
Only in 90%958
Only in ≥0.01%389

What this means:

80% Threshold Results - All 41 Tissue Samples

Each row shows one tissue sample and how many peaks were needed to account for 80% of its total signal. Samples with fewer peaks needed have more concentrated signal (dominated by a few compounds).

Leaf Tissue (21 total)

#IDPeaks NeededSmallest Peak Kept
1AL1,3170.0118%
2BL5470.0258%
3CL6920.0217%
4DL1,3170.0121%
5EL6640.0216%
6FL1,3030.0124%
7GL7560.0204%
8HL1,2890.0124%
9IL6920.0205%
10JL5680.0248%
11KL1,2170.0129%
12LL6760.0218%
13ML5980.0245%
14NL1,3140.0123%
15OL7470.0206%
16PL6610.0225%
17QL1,3760.0114%
18RL6140.0230%
19SL1,3780.0117%
20TL5190.0271%
21UL1,1430.0131%

Root Tissue (20 total)

#IDPeaks NeededSmallest Peak Kept
1AR4420.0332%
2BR2760.0445%
3CR3680.0368%
4DR4480.0338%
5ER5150.0264%
6FR3340.0360%
7GR4330.0325%
8HR1720.0678%
9IR3720.0392%
10JR4780.0297%
11KR2640.0446%
12LR2440.0473%
13MR4960.0304%
14NR3020.0441%
15PR2250.0469%
16QR1260.0706%
17RR4280.0341%
18SR4000.0413%
19TR4260.0339%
20UR1540.0727%

How to Read This Table

Example: AL (Leaf tissue from tree A) needed 1,317 peaks to reach 80% of its signal. The smallest peak kept contributed 0.0118% - anything contributing less was filtered as noise.

Example: UR (Root tissue from tree U) only needed 154 peaks because its signal is concentrated in fewer compounds. Only peaks contributing at least 0.0727% made the cut.

Filtering Methods Explained

Method 1: Cumulative Percentage (80% / 90%)

The requested approach for filtering noise from metabolomics data.

How it works (for each sample separately):
1. Take all peaks and their area values for one sample
2. Sort peaks from LARGEST to SMALLEST
3. Add up the areas as you go down the list
4. Stop when you've added up 80% (or 90%) of the total
5. Everything above that line is kept

Important: A compound is kept if it makes the cut in ANY sample.

Method 2: Minimum Percentage (≥ 0.01%)

How it works (for each sample separately):
1. Take all peaks and their area values for one sample
2. Add up all areas to get the total
3. For each peak: what percentage of the total is this?
4. If it's at least 0.01%, keep it
Key difference: No arbitrary cutoff. If a peak is ≥0.01% of signal, it's kept.

The Arbitrary Cutoff Problem

The algorithm keeps adding peaks until the cumulative sum crosses 80%. The last peak kept is the one that pushes you over the threshold:

Peak #% ContributionCumulativeStatus
13140.01182%79.9527%KEPT
13150.01181%79.9645%KEPT
13160.01180%79.9763%KEPT
13170.01179%79.9881%KEPT
13180.01178%79.9999%KEPT
13190.01177%80.0117%KEPT ← crossed 80%
13200.01176%80.0234%FILTERED
13210.01175%80.0352%FILTERED

Peak 1319 is the last one kept because it's the peak that pushed the cumulative total past 80%. Peak 1320 contributes almost the same amount (0.01176% vs 0.01177%) but is filtered because we already crossed the threshold. This is the "arbitrary cutoff" - two nearly identical peaks get different treatment based on where 80% happened to fall.

80% Cumulative Filtered Data

Peaks that contribute to 80% of the signal in at least one tissue sample. This is the requested filtering threshold.

Peaks Kept

3,740 Peaks
16.2% of original

Filtered Out

19,394 Peaks
83.8% removed as noise

90% Cumulative Filtered Data

More conservative threshold - keeps peaks that contribute to 90% of signal. Use this if 80% seems too aggressive.

Peaks Kept

6,777 Peaks
29.3% of original

Filtered Out

16,357 Peaks
70.7% removed as noise

≥ 0.01% Threshold Filtered Data

Alternative approach - keeps any peak contributing at least 0.01% of signal in any tissue sample. Avoids arbitrary rank-based cutoffs.

Peaks Kept

6,208 Peaks
26.8% of original

Filtered Out

16,926 Peaks
73.2% removed as noise

What Do These Peak Names Mean?

Each compound is identified by a code like 3.90_564.1489n. This encodes two measurements:

PartExampleMeaning
First number3.90Retention time (minutes) - how long it took to pass through the column
Second number564.1489Mass (m/z) - the molecular weight detected
Suffixn or m/zJust notation style
Important: These are NOT identified compounds. We know something with mass 564.1489 eluted at 3.90 minutes, but we don't know what molecule it is yet.

How to Identify Peaks

  1. Database search - Look up the mass in METLIN, HMDB, or MassBank
  2. Run standards - Buy a pure compound and see if it matches
  3. MS/MS fragmentation - Break it apart and look at the pieces
  4. Literature - Check what others found in similar plants

What You Can Do Without Identification